AI Site Reliability Engineer

Location: Singapore
Type: Contract
Post Date: Mon May 11 08:27:58 2026
Ref: 41901

AI Site Reliability Engineer

What’s on Offer:

Industry: Consulting
Location: Singapore
12 months contract role (with the possibility of extension)
Competitive Compensation

Job Summary:

We are seeking for AI Site Reliability Engineer to join our client’s core AI Platform Engineering team to embed reliability patterns directly into services as they are designed and built. You will also partner with the AI Security Engineer to ensure every system is secure by design, shifting security left into architecture, CI/CD pipelines, and operational practices. This role sits at the heart of the firm’s enterprise AI strategy, supporting mission-critical AI systems used across the organisation.

Job Description:

Own the enterprise AI gateway — Be the accountable owner for the LLM gateway and MCP gateway: architecture, SLOs (availability, latency, throughput), capacity planning, incident response, and roadmap.
Work on the SRE squads — Define on-call rotations, escalation paths, quality standards,
and vendor performance expectations.
Set the reliability standard — Define and enforce SLOs, error budgets, and error-budget
policies across all AI products. When budgets burn, you make the call: freeze features and
fix reliability.
Harden the platform as we build it — Work with the core platform engineering squad to
embed reliability patterns — circuit breakers, retry policies, graceful degradation, health
checks, deployment safety — into services from the first sprint.
Architect for AI-specific failure modes — Design mitigation strategies for non-
deterministic outputs, long-tail model latency, agent loops, cascading failures, and LLM
provider outages.
Partner on secure-by-design — Work with the AI Security Engineer to embed threat modelling, zero-trust controls, prompt-injection defences, and content-safety guardrails
into architecture and operations.

Eliminate toil — Automate incident detection, runbook execution, capacity scaling, deployment pipelines, and onboarding flows. Measure toil and reduce it over time.
Build operational excellence — Establish blameless postmortem culture, incident command structure, on-call health practices, and operational review cadences
Raise the engineering bar — Act as SME on production engineering best practices: testing strategies (chaos, canary, load, red-team), deployment safety (blue-green, progressive rollout), observability standards, and code-review discipline

Job Requirements:

3–8 years of experience in Site Reliability Engineering, Platform Engineering, or Software Engineering with strong production ownership

Deep expertise in SRE principles including SLOs, SLIs, error budgets, incident management, toil reduction, capacity planning, and chaos engineering

Proven experience managing SRE, platform operations, or production engineering squads, including augmented/vendor teams

Strong track record of owning critical API gateways, platform services, or high-throughput infrastructure with stringent availability targets (99.9%+ uptime)

Hands-on experience supporting AI/ML workloads in production environments, with understanding of: non-deterministic model behaviour, latency variance, token-budget management agent loops and LLM provider outages.

Strong cloud infrastructure knowledge, preferably AWS: VPC architecture, EKS / Kubernetes, Load balancing, Auto-scaling, Multi-AZ / multi-region architecture

Experience with observability platforms such as Datadog, Grafana, OpenTelemetry, PagerDuty, or equivalent

Experience with Infrastructure-as-Code tools such as Terraform or CDK

Experience building CI/CD pipelines using GitHub Actions, ArgoCD, or equivalent

Strong working proficiency in Python for tooling, debugging, and production issue resolution

Nice to Have:

Experience operating enterprise-scale LLM gateways, API gateways, or service meshes such as Kong, Envoy, or AWS API Gateway
Familiarity with MCP (Model Context Protocol) and MCP server fleet operations
Background in AI Security including prompt injection defence, content safety filtering, output grounding, and jailbreak mitigation
Experience with durable execution frameworks such as Temporal or Inngest
Exposure to highly regulated environments such as financial services, banking, or enterprise- scale technology organisations

Apply Now